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1. Introduction 

Lattice QCD is one of the promising method to study the non-perturbative aspect of QCD 
phenomena. The real world contains light up and down quarks with a mass in the raneg 2 — 10 
MeV. Since the most difficult part of the lattice QCD algorithm with light quarks is computing 
the quark propagators and incorporating the quark determinant to include the quark polarization 
effects, direct simulations at the real quark masses for up and down quarks are difficult. 

The algorithm which has been widely used to generate gauge configurations is the hybrid 
Monte Carlo (HMC) algorithm [|l]|. The difficulty in inverting the lattice Dirac opertor needed in 
the algorithm is proportional to the condition number of the operator and increases by a power of 
the smallest eigenvalue of the operator. For Wilson type quarks, since there is no lower bound 
for the Dirac operator, the occurrence of small eigenvalues results in a strong violation of energy 
conservation during the molecular dynamics evolution used in the HMC algorithm. The chiral 
overlap/domain wall fermions have a lower bound in its spectra, but require a large computational 
cost to maintain lattice chirality. Thus the values of quark masses employed in systematic dynami- 
cal QCD simulations have been limited to those around the physical strange quark mass (~50-100 
MeV) until the end of the last century despite the appearance of the tera-flops supercomputers, and 
only staggered-fermions were approaching near the real up-down quark masses. 

The situation has changed in the last few years for dynamical lattice QCD simulations. The 
technique of preconditioning and separatiopn of low eigenmodes applied to the Dirac operator, 
coupled with the Sexton- Weingarten [|2|] multiple time step molecular dynamics integrator has suc- 
ccessfully and drastically accelerated the HMC algorithm. These ideas are rather old [^. However, 
they had to wait recent applications and to reveal their effectiveness at light quark masses. Various 
IR/UV mode separation preconditioners have been developed and investigated. For optimizing the 
MD integrator, techniques for tuning the integrator parameters have been investigated to optimize 
the computational cost. 

The preconditioning is a basic technique to acceleratre linear equation solvers. The difficulty 
to solve linear equations depends on the condition number of the coefficient matrix. The condition 
number can be reduced by using the spectrum information of the coefficient matrix. The deflation 
preconditioning technique, which achieves the reduction by removing the low modes, has been 
investigated and has been known to be a powerful preconditioner [Q]. For the Wilson quark action 
the low mode deflation based on the local coherence assumption significantly improves the solver 
performance [^]. The multigrid solver/preconditioner can be also a powerful preconditioner when 
the effective lattice Dirac operator on coarse grid is adaptively constructed. The adaptive multigrid 
solver, which makes use of the same property of the low mode based on the local coherence, has 
now been applied and found to be effective for the Wilson-Dirac quark [^. 

Since dynamical lattice QCD simulations demand intensive computing resources, a variety of 
supercomputers have been constructed and used. The recent trend in high performance computing 
is the use of CPU with multiple or many cores on a single chip. The bottleneck with multi-core 
CPU may be the bandwidth between off-chip memory and CPU-cores. The multi-core trend seems 
to worsens the gap between CPU peak speed and memory bandwidth. The necessity for paral- 
lelism is also enhanced by the multi-core so that multi-core CPU's require careful programming 
for achieving high sustained performance for QCD applications. Computing with general Purpose 
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GPU (GPGPU) is an alternate trend which has been recently applied to lattice QCD. The GPU card 
architecture now available is based on Single Instruction Multiple Data (SIMD) or Single Instruc- 
tion Multiple Thread (SIMT) and can handle thousand of threads and parallelism with rather high 
memory bandwidth. 

This review is organized as follows. We first briefly describe the recent trends of machiens 
available for lattice QCD simulations, centering on multi/many core CPU and GPGPU. We then 
describe the algorithmic developments for dynamical QCD, including the preconditioning tech- 
nique for the QCD partition function and the improvements on the molecular dynamics integrator. 
We also discuss mixed precision schemes, the deflation and the multigrid techniques for solver 
improvements. 

Covering all topics related to lattice QCD algorithms is not possible in a single review. I wish 
to apologize to those colleagues whose work are not reviewed in the paper. Here I limit myself to 
explain only the HMC algorithm with Wilson-type fermions. 

2. Machine trends 

Historically, high performance supercomputers have been developed for lattice QCD simu- 
lations at each era of the progress of lattice QCD. The development of these computers involved 
elaborate tunings of various parameters such as memory bandwidth, interconnect bandwidth, total 
amount of memory, CPU speed, etc to obtain the best performance and efficiency. Recent trend 
shows, however, that the construction of dedicated machines from the CPU chip level becomes 
difficult due to increasing cost of the chip development. The cluster-type computer built with com- 
modity CPU is an alternative trend in high performance computing (HPC). A guide to build PC 
clusters for lattice QCD has been described in Ref. [^. Ref. [^] reviewed not only the PC clusters 
but also commercial supercomputers and the QCD dedicated machines in detail. New ingredients in 
the commodity/special parts trend during the recent several years are the appearance of multi-core 
CPU, and use of HPC accelerator cards such as ClearSpeed or GPGPU. A detailed benchmark 
of the QCD kernel performance on the recent chip architecture including multi-core and GPGPU 
has been carried out in Ref. [p^]. 

Processor vendors are now producing CPU with two or four cores. There are several archi- 
tectures, and most common one is the Intel architecture. IBM is supplying CPU's with PowerPC 
or Power architecture. SUN is also supplying CPU's with the Sparc architecture. The trend of 
commodity CPU chip is multi-core. The peak performance is easily multiplied by increasing the 
number of cores. The cutting-edge CPU manufacturing process is now around 60-45nm, and 32nm 
process will be available in the near future. The number of cores will be increased keeping the die 
size fixed by shrinking the process pattern. 

The bottleneck of multi-core CPU is the memory bandwidth since progress in the speed of 
memory subsystem is slower than that of the CPU system. To hide the memory latency between the 
core and the off-chip memory an additional memory hierarchy, such as a shared L3 cache memory 
on the chip, is inserted between the off-chip memory and the core exclusive cache memory. The 
QCD applications and algorithms should incorporate this architectural pattern and trends to keep 
high sustained speed. 
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Commodity PC clusters The PC-Cluster type machines based on commodity parts have a better 
price to peak performance ratio. The CPU peak speed now achieves 40-50GFlops with the quad 
core Intel architecture. However, commodity architecture is not always suitable for lattice simula- 
tions due to the low memory and interconnect bandwidth, because the memory bandwidth is still 
limited to ~ 10 GByte/s and a common interconnect is still the Gigabit Ethernet. The compute in- 
tensive part of the lattice QCD simulation is the matrix- vector multiplication of the hopping matrix 
which requires about 3 Byte/Flop in double precision. However, a typical value of the Byte/Flop 
for a common PC with a quad core single CPU running at 2.6GHz with a dual channel DDR2-1066 
memory is lower than 0.4. Moreover the situation becomes worse if the number of core is further 
increased while the memory system is kept the same. A similar situation exists in the network 
bandwidth between compute nodes. The real efficiency of commodity computing is limited by the 
bandwidth, and the total cost for a Cluster machine is determined by the cost of memory band- 
width or the interconnect bandwidth. We can customize the interconnect, or employ Myrinet or 
Infiniband rather than tre Gigabit Ethernet to enhance the network performance, but it is difficult to 
develop a custom-made memory subsystem in view of the price to cost performance. 
The PC Clusters belong to the category of fat-node supercomputers. 



Cell B.E. Sony-IBM-Toshiba Cell broadband Engine contains eight-SPU's and one PowerPC 
PPU on a single die. This also corresponds to a (heterogeneous) multi-core CPU. The first gen- 
eration Cell B.E. can handle only single precision arithmetic at high speed. This year, however, 
IBM released the new generation of Cell B.E. named PowerXCellSi with which each SPU has the 
peak performance 12.8GFlops in full double precision, and a total performance of lOOGFlops. The 
specific feature of Cell B.E. is that each SPU has its own local store (not cache memory) and can 
only access the local store. The data transfer between the local memory and off-chip memory is 
organized by a DMA engine and PPU. The bandwidth between the SPU and local memory is kept 
high, but the local to off-chip memory bandwidth is again limited at 25.6 GByte/sec. The byte per 
flop is 0.25, so an effective usage of local store from algorithmic aspect is required to keep high 
efficiency. Using the DMA/PPU and an improved algorithm with a comprehensive control on data 
flow could reduce the bottleneck. 

Performance evaluations of lattice QCD with Cell B.E. have been reported in Refs. [11, 12, 1^], 
in which the algorithmic and theoretical aspects have been discussed. With the previous Cell B.E., 
a sustained performace of 40GFlops or 20% of peak is achieved for single precision arithmetic. 
A Cell B.E. Cluster named QPACE (QCD PArallel on CEll) is being developed aiming at 
a 200TFlops machine which is based on the new PowerXCellSi CPU's. This is a fat node type 
supercomputer and the interconnect bandwidth can be a bottleneck. To keep a high efficiency for 
lattice QCD they develop a 3D netwrok with a custom-made network card. 



GPGPU To achieve high throughput computing with commodity parts, so-called General Pur- 
pose GPU (GPGPU) computing has been proposed in a couple of years. Modern graphic proces- 
sor units (GPU) can compute realistic 3D graphic images at real time for gaming. Although the 
functions of GPU at early days were limited and fixed for graphics only, GPU makers are now 
developing the functionality for more general computing including more realistic physical motions 
in a virtual world, making use of an increasing availability of transistors per unit area. 
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The Nvidia and AMD/ATI are the main suppher for the high performance GPU cards. These 
companies push GPGPU computing for their GPU products by providing the language and pro- 
gramming model, I.e., CUDA (Compute unified device architecture) JTsI ] for Nvidia and AMD 



stream computing for AMD/ ATI []16[]. Recent GPU chips can execute highly parallel operations 
and threads by Single Instruction Multiple Data (SIMD) or Single Instruction Multiple Threads 
(SIMT) architecture. They accomplished the peak performance exceeding 1 TFLOPS for single 
precision on a single accelerator. The double precision performance is now at around 100 GFlops. 
The bandwidth between GPU and on-board memory is much higher than that of the commodity 
PC's in order to keep the real time graphic computing. The bandwidth is ~ 100 GByte/sec com- 
pared to 16 GByte of dual channel DDR2-1066. The byte/flop is ~ 1 for double precision and ~ 
0. 1 for single precision. The drawback of the GPU computing is less flexibility in the control flow 
or branch instruction due to the SIMD hke hardware structure. 

If this kind of GPU computing becomes common, a compute node can have a peak perfor- 
mance of the order of 1 TFlops. The GPU type cluster also corresponds to a fat node supercom- 
puter. The bottleneck of GPU computing is the bandwidth between host PC and GPU card, which 
is usually connected by an PCI-Express bus, and the node-node interconnect bandwidth. The sus- 
tained bandwidth for PCI-E (2nd generation) is about 2GByte/sec. To keep high efficiency with 
the GPU nodes, we have to use an algorithm which sends a large task to the GPU card thereby 
reducing the host-GPU communications. 



The first GPGPU computing for lattice QCD has been made in Ref. [ ]17| ] using the OpenGL 



(an API for graphic application) programming model. A GPU cluster has also constructed, and 



has been used for finite temperature lattice QCD simulations [18]. The applicability of CUDA for 



lattice simulations has also been reported this year [ p^ pO| ]. Ref. [ pO| ] reports that they achieved 
over 90 GFlops for the even/odd-site preconditioned Wilson-Dirac operator in single precision, and 
about SOGFlops for Wilson CG algorithm on 16^ x T and 32^ x T lattices using a newest Nvidia's 
GPU card GTX280. 

The Nvidia distributes CUDA SDK and compiler, and the language is very similar to the C 
language. The AMD/ATI also distributes Stream SDK which contains CAL (a low level GPU 
language) and Brook-i- (a stream computing language and an extension of C-i~i-). These SDK are 
available from their web pages. It seems that Nvidia has an advantage over AMD/ ATI because 
the newbie instruction and sample programs for CUDA are widely available on the Internet. One 
can easily experiment the CUDA programming with a common PC with Nvidia's graphic card 
(GeForce 8 series and newer). The cutting-edge HPC card of Nvidia named Tesla series, which has 



no graphic output facility and dedicated for HPC computing, is also available []21[]. The AMD/ATI 



is also shipping HPC dedicated cards named Firestream series Jig]. 



0(1) ~ 0(10) Peta Flops machine PFlops-scale machines can be categorized by the size or 
the performance of the single compute node. Thin-node supercomputers, such as BlueGene/L/P 
and QCDOC, are constructed with nodes having a few GFlops of computing performance and a 
balanced memory and interconnect bandwidth. This year the Pet-APE (Peta flops Array Processor 



Experiment) project (successor of the APE series) has been announced [22]. This machine belongs 
the thin-node supercomputer. A 1 PFlops machine can be constructed with 100,000 of compute 
nodes with lOGFlops/node. This requires a uniform fine grained parallelization to achieve the 
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best performance. The programing model experienced with lattice QCD is still applicable to the 
thin-node model. 

The fat-node model is based on the node with multiple CPU's and multi-core CPU. Multi-core 
CPU's at a 100 GFlops level is already available, and higher ones will appear soon. Roadrun- 
ner [0], which belong to the fat-node model, became the first supercomputer to achieve a sus- 



tained speed of over IPFlops [ ]24p in the world. This system is a hybrid supercomputer whose 
node consists of two dual-core Opteron, each attached with an accelerator card containing two 
PoweXCellSi chips. The total of 3,456 nodes with 436GFlops of peak speed for each nodes reach 
1.5PFlops. We expect that machines similar to Roadrunner system become more common in near 
future. We observe that Roadrunner has a multiple hierarchy for the data flow, i.e., CPU to CPU, 
off-chip memory to CPU, core-to-core etc., in addition to the node-to-node interconnect. A careful 
programming and a dedicated algorithm are required on the data flow management for such archi- 
tecture to achieve an efficient performance in this case, such as cache blocking within each core to 
reduce core-to-core data flow in addition to a CPU level data blocking. 

3. Algorithmic developments for dynamical lattice QCD simulations 

The HMC algorithm has been improved over a long time in order to approach physically light 
quark masses, which may be called the deep chiral region. The significant improvement in recent 
years came from the combination of preconditioning to split the quark determinant into an IR and 
UV modes and using the Sexton-Weingarten multiple-time step molecular dynamics (MTSMD) 
integrator to reduce the frequency of Dirac matrix inversions |^. 

3.1 Effective action preconditioning and multiple time scale MD integrator 

Preconditioning is a common technique to solve large-scale linear equations efficiently. Vari- 



ous preconditionings have been used in lattice QCD, such as ILU [|25[], even/odd site, SSOR pq ] 
etc.. This technique can be applied to precondition the quark determinant of the HMC partition 
function and has been used to enhance the performance of the HMC algorthm. 

The lattice QCD partition function contains the quark determinant and the computationally 
dominant part is how to evaluate the quark determinant. The HMC algorithm introduces a fictitious 
gauge momentum into the partition function: 



^nW|det[D]pe"2n'-Ss[f^], (3.1) 



where Sg \U ] is the gauge action, D is the lattice Dirac operator, and P is the canonical momenta of 
the link field U . Here a degenerate Nf = 2 quarks are assumed. The configuration U is generated 
according to the provability weight |det [D] ^ e^^^ -Sg[u]_ jj^g usual HMC algorithm introduces an 
auxiliary field ^ to evaluate the determinant as 

2^ = j QU&U &<^^ &<^e-'^^'-'^^^^^-\^'''^\' = j ^n^[/^0^^0e-^^J^P'^'^l, (3.2) 

where Heff is the effective HMC Hamiltonian. The HMC algorithm carries out microcanocni- 
cal molecular dynamics (MD) evolution with respect to the effective Hamiltonian followed by a 
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Metropolis accept reject test. The most time consuming part is the MD evolution, especially the 
force computation from the quark potential |D^'0p. 

The preconditioning technique amounts to a determinant transformation as 

|det [D]\^ = |det [PD] /det [P] p , (3.3) 

where D is the lattice Dirac operator, and P is a preconditioned Introducing a pseudo fermion fields 
for each determinant, we have 

|det[D]p = y^0t^^^^t^^g-|(PZ))->|'-|PzP = y'^^t^^^^t^^,^-v«[c/,0]-v^v[^^^^ (3_4) 

where Vir[U,<^] = |(PD)"'0|^ and Vuv[U ,<^] = \Pxf'. If we construct P so that DP contains the 
IR modes of D, and P carries the UV modes of D, the HMC effective action is split into two parts 
governed by different physical scales. If, furthermore, the MD force contributions have a hierar- 
chy such that \8Vir/5U\ < \8Vuv /8U\, a multi time step integration of the molecular dynamics 
(MTSMD) is applicable. This minimizes the calculation of the cost intensive part 8Vir/8U by 
assigning a large time step to the momentum update from Vir and a short time step to the Vj/y 
part. The empirical observation that the large fermionic fluctuation in lattice QCD comes from the 
localized/UV mode allows this splitting and assignment. This is the recent strategy to improve the 
HMC algorithm [|]. 

Polynomial preconditioning To solve linear equation Dx = b, a. polynomial of D is a possible 

N I ■ _| 

preconditioner, P = L ;=o' ^j^^ ~ ^ ■ ^^^^ preconditioner can be applied to split UV/IR mode to 
speed up the HMC algorithm along with the strategy described above. 

Hasenbusch's heavy mass preconditioner corresponds to this category in which P = 
{D')^^ with a heavier quark mass operator D' is chosen. With this preconditioner DP contains the 
light modes and DP ~ 1 when the mass difference is small, while P^ = D' contains the heavyer 
modes. Thus the heavy mass parameter works as a UV cut-off. The original aim for this precondi- 
tioner was to extend the MD step size by reducing the MD force norm by the proposed determinant 
separation. Soon after, however, it was realized that the combination with the multi time step MD 



integration leads to an improvement of the HMC algorithm p8| , 29, 30]. By tuning the MTSMD 
step size and the heavy mass of the preconditioner 0(10) improvement has been observed [^]. 
The Hasenbusch's heavy mass preconditioner is applicable to any type of quark action and easy to 
implement. Thus the method has been employed in large scale simulations by many groups. 

In Ref . [ pip a polynomial preconditioner P = Y.j=Q cjD^ ~ D has been applied to the HMC 



algorithm. In this case the order of the polynomial Npoiy corresponds to the UV cut-off instead of 
the heavy mass parameter of the Hasenbusch's preconditioner. 



A partial integration of the UV part is possible via the UV-filtering [32]. This filtering has been 



proposed for the MultiBoson algorithm [33], but it is also applicable to the HMC algorithm [p4|]. 
The UV-filter preconditioner is defined by P = e^^, where M = D — l. For the unimproved Wilson 
quark action M is the hopping matrix (times — K"). Inserting P into the determinant we obtain 

det[D] = det[De-'^]det[e^], 

= det[De-^]e'^''^^^ = det[De-^] . (3.5) 
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where the trace term is zero for the unimproved Wilson quark. The remaining matrix De ^ can be 
expanded in K as 

De-"" = {l+M)e-'' = \- — + - (3.6) 

where the 0{M) term is removed and De^^ has a smaller condition number. Thus the MD force 



norm is reduced and a larger MD step size is possible [[34p. Here we only discussed filtering 



with the 0{M) term. Higher order filtering and combination with even/odd site preconditioning 



is possible [p2|]. The UV-filtering is easy to implement only for the quark action with nearest- 
neighbor coupling; otherwise the trace becomes non-local and difficult to implement. A further 
aggressive approach to remove the UV mode from the determinant is to change the fermion action 
to the UV-suppressed one [^]. 

Geometric preconditioning For the Wilson and KS type fermions, the most common precondi- 
tioner is the even/odd site preconditioning using the nearest neigbour nature of the coupling matrix. 
By ordering the lattice sites so that even-sites come first and odd-sites the last, the Wilson-Dirac 
operator is expressed as the following block 2x2 matrix: 



M. 



D=\ I , (3.7) 

\Moe loo I 

where l^e (loo) is the identity matrix acting on even (odd) sites, andM^o (Moe) is the matrix hopping 
from odd sites to even sites (vice versa). Sandwiching with a left L and a right R preconditioner 
defined by 

L=f^--^-y R={ M, (3.8) 

we can diagonalize D in the block form as 

LDR= I 1 I 11 " . I " I (3 9) 





where D^e = 1 —Me„M„e the Schur complement of D. Using det[R ^] = det[L '] = 1 the quark 
determinant is reduced to 

|det[Z)]p= |det[D,,]p, (3.10) 

The pseudo fermion is introduced to the even-site preconditioned operator Dee- The condition 
number is reduced and a larger MD step size is possible. The even/odd-site preconditioned operator 
Dee has been treated as the basic operator, and further preconditioning, such as Hasenbusch's one, 
has been applied to Dee to enhance the HMC performance. 

The ILU preconditioning with global lexicographic ordering is a more powerful precondi- 



tioner [ ]25| ] than the even/odd preconditioner. ILU with various ordering has been investigated Pq ] 
in the context of linear solver preconditioning. 

Application of the ILU preconditioning to the HMC algorithm is also possible [pi |3^]. By 



ordering lattice sites properly, we can decompose D to \ + U + L with a strictly upper triangular 
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part U and a lower part L. Using the left and right preconditioning, the quark determinant reduces 
to 



det[D] = det[(l+L)"^D(l + det[l+L]det[l + [/] 
= A&t[{\+Ly^D{l + Uy^] = det[D]. 



(3.11) 



The HMC algorithm based on the determinant det[D] is obtained and larger MD step size is ex- 
pected when the ILU preconditioned D has a reduced condition number. For linear solver the 
efficiency of the ILU preconditioning depends on the site ordering and it is known that there is 



a trade-off between the efficiency and parallelism []37|]. Therefore there may exist a similar the 
trade-off between the efficiency and parallelism for the HMC algorithm with ILU preconditioning. 
The geometric preconditioner explained above is based on the site orientation. A domain 



oriented preconditioner is also possible as has been proposed by Liischer []38|]. The lattice sites are 
grouped and colored into two domains, and the domains are blocked and ordered in a checker board 
ordering (even/odd ordering). Based on this domain decomposition, the domain-decomposition 
HMC (DDHMC) algorithm is constructed. The preconditioning follows the same way as in the 
even/odd site preconditioner. The Wilson-Dirac operator can be represented in a similar block 
2x2 form as 

Dee Deo \ 
Doe Doo I ' 



D 



(3.12) 



where Dee {Doo) is the operator within even-domain (odd-domain), and Deo {Doe) is the operator 
connecting from odd to even-domain (vice versa). The determinant reduces to 



detfDl 



det 



Dee Deo 
Doe Doo 

det [Dee] det [Doo] det [\ee - {Dee)'^Deo{Doo)'^Doe] ■ 



(3.13) 



The operator (l^^ — {Dee) ^^Deo{Doo) '^Doe) is the Schur complement by the block even/odd 
preconditioning. This preconditioner introduces a sharp cut-off by the block size at which the IR 
and UV scales are decoupled. The original proposal is that the block extent Lbiock satisfy Ag^^ ~ 
Lbiock from the physical point of view. The Schur complement part represents the physics below the 
confinement scale and the domain-restricted part from the lattice cut-off scale to the confinement 



scale. Introducing pseudo-fermion fields for each determinant of Eq. ( [3.13[ ) we can construct the 
so called domain-decomposed HMC (DDHMC) algorithm. 

For actual implementation of the DDHMC further improvements are achieved by precondi- 



tioning Eq. ( |3.13| ) and by incorporating the dead/alive link method [pSQ. The UV part can be 
preconditioned by the site-oriented even/odd or the ILU preconditioner, and the IR part using the 
spin structure. The link fields connected to each domain and located parallel to the domain surface 
can be kept fixed during the MD evolution. With this procedure the internode communication for 
exchanging the updated link fields can be omitted. This is a semi-local updates algorithm. To 
ensure the ergodicity of MCMC, a random shift of the lattice coordinate origin is carried out after 
each HMC evolution. This method, the dead/alive link method, makes the DDHMC algorithm 
more effective in parallelism and internode bandwidth, but it is applicable only for the quark action 
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with nearest-neighbor coupUngs such as the unimproved/clover Wilson fermion or simple KS- 
fermion actions. The effectiveness of the domain-decomposition factorization strongly depends on 
the structure of the quark action and is maximised for nearest-neighbor coupling action. 

Extensive studies of the DDHMC algorithm have been carried out in Refs. [3^, 4C] which clar- 
ified the effectiveness toward the chiral limit. The efficiency of the DDHMC algorithm generally 
depends on the shape and the size of the domain. The 1 -dimensional striping domain decomposi- 
tion version has been investigated in Ref. pi]]. 



Another Schur complement approach has been discussed in Ref. [|42[], where the RG con- 
sideration and the Schur complement approach are combined to construct a better kernel for the 
domain-wall/overlap fermion actions. The derivation shares the same property that the quark de- 
terminant can be transformed using UV/IR separation. 



Rational approximation and implicit scale splitting Originally the Rational HMC algorithm 
has been proposed to simulate non-local complicated quark actions (this does not mean physically 
non-local) such as the two-flavor KS fermion action for which a fractional power or nth- 

root of the quark determinant is required. The fractional power decomposition has another benefit 
to accelerate the HMC algorithm in that the fractional power reduces the condition number of the 
quark operator with a consequent reduction in the maximum MD force norm as in the case of the 
Hasenbusch's heavy mass preconditioner. Moreover the rational approximation for the fractional 
power quark operator introduces the possibility to enhance the efficiency through a combination 
with the partial fraction form of the rational approximation and the MTSMD integrator. 

Here we explain the method using the two-flavor Wilson quark action as an example. The 
quark determinant can be split in n pieces as 



|det[Z)]|^ = det[D^D] = det[e^] = det[(G^)^]", 



(3.14) 



where Q = y^D and is Hermitian and non-negative. By introducing n-pseudo fermion fields, we 
have 

n 



|det[D]|^ 



(3.15) 



At this point the (maximum) MD force norm is 
estimated as n^u^/" (/x: the condition number of 
Q^) naively. Since the MD force norm without 
the nth-root trick is simply /x, an nji^^"^^ reduc- 
tion of the MD force norm is expected. The op- 
timal n has been estimated as Wopt = (logM)/2 
via the force norm and the computational cost 
estimate. 

The «th-root operator {Q^Y^" can be well 
approximated by the rational approximation: 



n=2CG 
n=3 CG 



JL 



Partial Fraction 



2^l/n 



(3.16) 



Figure 1: The variation of the force magnitude 
(L2 norm), conjugate gradient (CG) cost and 
parameter for each partial fraction with n = 2,3 
pseudo-fermions. Figure taken from Ref. 
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This partial fraction form enables us to split the MD potential into several pieces. The coefficients 
Uj and jSy have the special property, which makes the UV/IR splitting effective, such that the most 
cost intensive part coming from the small shift j3j has a small corresponding residue. This situation 
is illustrated in Figure [I] which shows the hierarchy of the MD force magnitude as a function of 



fraction number [44]. Making use of this UV/IR property the quark potential Sq can be classified 



into several pieces as 

n Nni, n Nr 

5,= ££«^</>;(e'+^^)"V; + I £ 0Ck^jiQ^ + l5k)~'^j, (3.17) 

j=lk=l j=\k=N„ + \ 

where the residue and shift coefficients are ordered to satisfy j3yt < ^k+\- Here we separate modes 
into two groups for illustration; the first group corresponds to the IR part and the last one the UV 
part in this case. The parameter Ncut controls the UV/IR separation and is chosen to minimize 
the total cost. This mode splitting is rather implicit compared to the determinant preconditioner 
method. Combining with the MTSMD integrator the HMC algorithm can be constructed. The re- 
sulting algorithm is called Rational HMC (RHMC) algorithm. To make the RHMC algorithm more 
effective the geometric preconditioning is applied to precede the rational approximation in realistic 
simulations. The approximation theory for matrix functions including the rational approximation 
has been described in Ref. [B^]. For more detailed implementation of the RHMC algorithm see 



Ref. [gSp. 

We have discussed the RHMC algorithm for the Wilson quark kernel for illustration, and 
we did not mention the perfectness of the rational approximation. To make this algorithm exact we 
need to know the spectrum boundary of the kernel operator {Q^ in this example). Given the spec- 
trum interval the error from the approximation is easily controlled to satisfy a required precision to 
make the algorithm exact. This requires the computation of the smallest and largest eigenvalues for 
Q^. In practice, however, to minimize the total computational cost, an accurate rational approxima- 
tion is used only for the Hamiltonian computation and the pseudo-fermion field generation, and less 
accurate approximation is used for the MD evolution p6|]. If no spectrum interval data are avail- 
able for the preconditioned matrix, we need to put a speculative bound and check the exactness or 



correct the error with reweighting or noisy Metropolis methods with controlled manner [|47[]. The 
RHMC algorithm is applicable to any type of quark action and has been used extensively for large 
scale simulations with various quark actions [48, 51 1. 



The rational approximation we have discussed is based on the Hermitian kernel operator Q^. 
Non-Hermitian variants are also possible as in the case of the MultiBoson and the PHMC algo- 
rithms [||, H, H], H]. The non-Hermitian variant would especially more effective than the 



Hermitian version in case of the Wilson quark action. The non-Hermitian rational approximation 
for D^'/^ is obtained via the Cauchy's contour integral representation as the matrix function of D. 
The property of the contour integral representation for a matrix function has been investigated in 



Ref. [56] 



3.2 Molecular dynamics integrator 

On a par with the determinant preconditioning, the use of Sexton- Weingarten multiple-time 
step MD (MTSMD) integrator is the key technique to enhance the performance of the HMC algo- 
rithms. Here we explain the basics of the MTSMD integrator and the recent improvement on the 
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MTSMD integrator including the Omelyan-Mryglod-ForUc scheme [57, 58] and the tuning method 



for the MD integrator |60|] 



The MD integrator used in the HMC algorithms is based on the symplectic integrator with 
exact time-reversal symmetry. The classical trajectory y{t) = {q{t),p{t)) in the phase space of a 
system is governed by the Hamiltonian with H{q,p) = T{p) + 'V{q) and the formal solution of the 
trajectory is written by 

7(0=exp[fLff]7(0), (3.18) 

where the linear operator Lu is defined via the Poisson bracket {,} as = {X,//} with X = 
X{y) a function of y. Although the linear operator Lh can be decomposed as Lh = Lj +Lv, 
the exponential operator cannot be expressed in a simple form because [Lr,Lv] 7^ in general. 
Thus an approximation for the exact evolution operator should be introduced. This affects the 
efficiency of the HMC algorithm through violation of the energy conservation. The MD integrator 
is usually constructed by decomposing the exponential operator exp [?L//] with the simple operators 
Q{t) = exp [rLr] and P{t) = exp [?Ly] . The operation of Q and P is 

Q{t)X{q,p)=X{q + tp,p), P{t)X{q,p)=x{q,p-t^. (3.19) 
For example the second order leapfrog (2LF) integrator is based on the following decomposition: 

e'^" - (y ) 2(50^ (y ) ) "° = U2LF,PQp{t;NMD), (3.20) 

with 8t = t /Nmd and Nmd the MD step number. An alternate version in which the order of Q and 
P is interchanged is also possible. Our task is to find a better integrator with higher accuracy and 
lower computational cost. The error of the integrator can be evaluated by estimating the so called 
shadow Hamiltonian H', which is close to the original H and is exactly conserved along with the 
approximate trajectory. H' can be extracted using the Baker-Campbell-Hausdorff (BCH) formula. 
For the 2LF case we define H' via 

U2LF,PQp{t,NMD)=e^'^', (3.21) 

where we find that H' = H - {{V,{V,T}} + 2{T,{V,T}}) 5f /24 + 0{5t^). The 2LF integrator 
has 5t^ error. 

Higher order integrators can be constructed by nesting the 2LF integrator and tuning the step 
size at each nesting depth to eliminate the 0{8f''^) terms in H' [61, 64]. It has been shown, 



however, that the 2LF integrator is the best in efficiency for dynamical QCD simulations with light 



quarks in large volumes [65|. 



Omelyan et al. [|^] have introduced a new decomposition scheme which minimizes the error 
terms in the shadow Hamiltonian instead of eliminating them. This scheme decomposes a single 
MD step into several steps and introduces tunable parameters to scale the divided time step. For 
example the 2nd order scheme can be further divided to 

e'^- ^ (p e ) P ((1 - 2X)dt) Q P {Xdt)^"' , (3.22) 
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where A is a tunable parameter. The shadow Hamiltonian reads 



H' 



H + [a(A){V, {VJ}} + iS(A){r, {V, r}}] 8t^ + 0{8t') 
6A2-6A + 1 1-6A 



12 



/3(A) 



24 



where 0{5t^) terms still remain. Assuming {V, {V, T}} ~ {7, {V, 7}} ~ 0(1) and minimizing 
A/a(A)2 + /3(A)2, the 0(5?^) error term can be minimized at A,. ~ 0. 193 183 327 This is 



the so called 2nd order minimum norm integrator (2MN) [ ]58[ ] 

This scheme has twice the momentum up- 
date compared to the 2LF scheme and the com- 
putational cost is doubled. However the 2MN in- 
tegrator would have higher acceptance ratio with 
smaller violation of the energy conservation low 
so that there is chance to improve the HMC al- 



0.05 



X 

< 

V 



O position 
X velocity 




Figure 2: {dH^yl^ as a function of A. Simula- 
tions are performed at j3 =5.00 and K = 0.160 
on 4^ lattices with 5t = 0.05. Figure taken from 
Ref. [|5l. 



gorithm. Takaishi and Forcrand [58] have inves- 
tigated the 2MN integrator and 4th-order mini- 
mum norm (4MN) integrator for the HMC algo- 
rithm in lattice QCD. They found that the 2MN 
integrator has a better performance than the 2LF 
integrator by about 50% and the version integrat- 
ing the position first {Q first) has slightly better 
performance than the reversed version. The op- 
timal A can be obtained by measuring {5H^) as 
a function of A since {8H'^) is the metric for the 
HMC acceptance ratio. They found that the optimal one may differ from Af and depends on the 
action and the parameters (Fig. ^. 

The method described above opens a new way to customize the integrator for a given ac- 
tion. A direct optimization on {5H^), however, requires parallel simulations with several A or MD 
schemes, which is rather compute intensive. Clark, Kennedy and Silva [^] proposed a tuning 
scheme for the MD integrator with less computational cost. As shown above any MD integrator 
conserves the corresponding shadow Hamiltonian whose error terms are expressed in terms of the 
Poisson bracket analytically. Measuring the expectation value of the Poisson brackets on equili- 
brated configurations enables us to reconstruct the error terms as a function of tunable parameters 
for any MD integrator (up to desired order in 5t). The tuning strategy they proposed is as follows. 
Defining the error terms as AH = H' — H, the energy difference after the time evoution with t be- 
comes 5H = H{t) — H{0) = AH (0) — AH (t). Thus we expect the optimal acceptance is achieved 
at the minimum of {{5Hf) = {{AH{0)-AH{t)f). Assuming that AH{t) and AH{0) are indepen- 
dent and have the same histogram, we obtain ((5//)^) = 2 Variance (A//) where AH are constructed 
with the Poisson brackets measured on equilibrated configurations. Therefore we expect that the 
optimal parameter minimizing the variance of AH coincides with that minimizing {{5H)^). This is 
a indirect optimization of {{5H)^). They have demonstrated this method with the 2MN integrators 
and found a good agreement for the optimal A between the direct and the indirect methods. 
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The Omelyan (or 2MN) integrator can be used as the building block to construct a recur- 
sive MTSMD integrator. In this case the several tunable parameters (A for ex. 2MN) are intro- 
duced at each depth of the recursion. The motion governed by the Hamiltonian with Np potentials 
H{q,p) = T{p) + lJ=o ' ^;(^)' where the potential terms are ordered as \dVj^i/dq\ > \dVj/dq\, 
can be integrated via 



Ujit-Nj) 
Uo{t;No) 



Nj J ' \2Nj ^ 



{l-2Xj)t 



Q 



Uj-i I — ;A^/-i ]Pi{ — 



-\N, 



2No 



No 



(3.23) 



with^j = {No,Ni, - ■ ■ ,Nj) the timescale parameter and /y(f)X(^,/7) = X(^,/7 — f^Vy/^^) the mo- 
mentum update with j-th potential. This is based on the PQPQP ordered Omelyan integrator. The 
QPQPQ ordered version leads to 



Q 



M,)^°(2A^o) 



(l-2Ao)A / Mgf^i 



No 



(3.24) 



Note that the step number to integrate Uj-i part can be different between the inner and the outer 
Uj-\ in the QPQPQ version since the integration length differs between them and this may further 
reduce the total computational cost. The MTSMD 2MN integrator recursively constructed has been 
used in Refs. [||, || |7, " 



3.3 Solver improvements 

The performance issue of linear solvers is important in lattice QCD simulations. In this section 
we describe the recent solver improvements for the Wilson-type quark actions. The recent topics to 
speed up the solver are categorized in two; (i) use of flexible preconditioners with single precision, 
(ii) use of the spectrum information to construct preconditioners. 

Flexible preconditioner The key for the preconditioner P for solving a linear equation is to 
construct a computationaly inexpensive approximation for D ^ ~ P. Once P is constructed the 
hnear equation Dx = b is preconditioned as DPy = b,x = Py ox PDx = Ph. A naive idea for P is 
the use of polynomial approximation for D ' . However this does not improve the solver in terms 
of the total count of matrix-vector multiplication. 

If P is computed in single precision and if the single precision efficiency is computationally 
higher than that with double precision, the use of polynomial preconditioner has a chance to im- 
prove the linear solver timing. With this motivation double precision (DP) solvers coupled with 
single-precision (SP) preconditioners (so called mixed precision solvers) have been proposed []6^]. 
Since the SP performance is much higher than DP for recent commodity architectures, such as 
IA32 or AMD64/Intel64 architecture, GPGPU, and Cell B.E, the mixed precision solver should be 
one of the choice to improve the algorithm. 
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The solver algorithm with the mixed precision (or flexible) preconditioner is based on the 
Richardson iteration (or iterative refinement) technique [48]. The solver requires "flexibility" 
which means that the preconditioner can be changed from iteration to iteration in the solver al- 
gorithm. This is needed since the precision conversion between SP and DP cannot be expressed 
in a fixed matrix form in DP. The iterative refinement algorithm to solve Dx = b with a flexible 
preconditioner P ~ is given in Alg. [I[ The preconditioner P is computed at the 4th line and 
the accuracy can vary from iteration to iteration. To compute the preconditioner P, any iterative 
solver can be used; this is called an inner-solver. The residual and solution are updated at the 5th 
Une. The accumulated residual and the solution hold the relation r = b — Dx in DP. Any flexible 
preconditioned solver can be constructed via the iterative refinement algorithm. The mixed preci- 



sion solver for lattice QCD has been used in Refs |3^, ^] and a significant speed up has 
been observed. 



Algorithm 1 Richerdson iteration 

1: xis an initial guess, P is an approximate inverse, P ~ (preconditioner), 

2: Calculate initial residual r = b — Dx, 

3: loop 

4: Calculate correction vector 8x = Pr ^ D^r by roughly solving D5x = r. 

5: r = r — Ddx, x = x + 5x, 

6: if 1 1''! I is small enough then 

7: break. 

8: end if 

9: end loop 



Preconditioner with spectrum information The solver performance depends on the spectrum 
and the condition number of the coefficient matrix. If the spectrum information is known the linear 
equation can be preconditioned. 

Let Cp = {co,ci , • • ■ ,Cy,-i} be the /^-dimensional eigen-subspace basis corresponding to the 
small eigenvalues of D. Cp satisfies the partial Schur decomposition as 

DCp — CpTp, CpCp — /p, (3.25) 

where Tp is a strictly upper triangular /)-dimensional matrix and contains eigenvalues as the di- 
agonal elements. There are two methods for preconditioning with the eigen-subspace. One is an 
additive construction and another is a multiplicative one. The additive preconditioner which re- 
moves the subspace from D is called deflation [Q] and can be constructed as follows. Using the 
projection operators P and Q defined by 

P={l-CpCl), Q = {\-CpTp'clD), PD = DQ, (3.26) 

we can remove the near zero eigen-subspace form D. The equation PDx = Pb is easily solved 
due to a reduced condition number and the full solution is obtained by x = Qx + CpT^^C^j. The 



multiplicative preconditioner [ [70| ] to reduce the condition number with spectrum information is 



P = CpT-^Cl. (3.27) 
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This mimics in the subspace and the preconditioned equation PDx = Pb is solved easily again. 

The bottleneck to use the spectrum information as a preconditioner is the construction cost 
of the subspace. The cost reduction is achieved by several independent ways; (i) use the approx- 
imate eigen-subspace to reduce the cost, because exact eigen-subspace is not always required for 
the preconditioner. (ii) reuse/recycle the subspace among multiple linear equations with different 
right-hand vectors, (iii) construct subspace within the Krylov-subspace iterative Unear solver, be- 
cause the eigen-subspace can be constructed from the same Krylov-subspace and the task for the 
linear solver and the eigen solver can be overlapped. These methods are combined with deflation 
technique and a significant improvement has been observed. The details for deflation technique 
and recent developments are described in Ref. [Q]. 

Deflation with low-mode and local coherence The dimension of subspace affects the perfor- 
mance of the deflation preconditioner. Since the density of near zero eigen-modes is proportional 
to the system size V in order to keep the condition number at a lower level we have to increase the 
dimension as the volume. The cost to find an eigenmode is also proportionals to V . This means 
that the total cost is of O(y^). This bottleneck can be circumvented by incorporating the physical 
insight on the near zero eigen-modes of D. 

Liischer has constructed a low-mode deflation subspace via a geometric lattice blocking as- 
suming local coherence property of the low modes [^]. The subspace is constructed as follows; 
(i) supply the low-mode enhanced vectors as (i///(x) ~ {D^^yrji, with random vectors rji, I = 
1, • ■ • ,Ns), where A'^, ~ 12 or 0(10), V ~ 3, and the Dirac operator near critical mass with rough 
approximate inversion, (ii) block Yi (x) as 

V./^(x) = |'^'^")^f"^^' (3.28) 

and then ^if^{x) are orthogonalized to yield cf{x). The subspace is constructed as Cp = {c^(x), A = 
lattice domain block index}. 

The dimension of Cp is /? = Ns x (#of blocks) and is significantly enlarged by the blocking. 
This blocked subspace has a significant benefit that the memory requirement is low compared to 
the full vector basis when compute node parallelism is matched to the domain decomposition. This 
removes the O(V^) bottleneck practically. The low-mode approximation for D leads to 

Ao^(/,A;j,A') = (c^)^Dcf, (3.29) 

where the index A and A' represent the coarse lattice grid index, and / and j (= 1 , • • • ,Ns) mimic the 
internal degree of freedom similar to color-spin. The low-mode Diow still has the nearest-neighbor 
structure on the coarse grid and is easy to invert via iterative solvers. This blocking is very similar 
to real space RG-blocking. 

Liischer constructed the deflation preconditioner from this blocked subspace as 

P=(l-DCp(Aow)"'Cp, Q = {\-Cp{D,,^)-'clD), PD = DQ, (3.30) 

and the solution for Dx = b is obtained with x = Qx + DCp{D\ov/)^^Cj,b by solving PDx = Pb for 
X. A significant improvements and removal of critical slowing down at small quark masses are 
observed not only in solver performance but also in the DDHMC algorithm 
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Multigrid with low-mode The RG-blocking like point of view for linear solvers is called the 
multigrid (MG) solver/preconditioner. The MG iteration is based on the multiplicative precondi- 
tioner as Eq. ( 3.27 ). The extension to multilevel MG is straightforward. In the context of the MG 
algorithm the operators Cjj and Cp correspond to the restriction operator and interpolation (prolon- 
gation) operator respectively. The efficiency of MG depends strongly on the choice of the subspace 
Cp. 

The earlier attempt of the MG approach for lattice field theory was based on the purely ge- 
ometric multigriding combined with the RG blocking for link and fermion fields and mass pa- 
rameters Later the multigrid method has been generalized and arranged to the projective 



multigrid [[72|]. In Refs. | |72| ] the projection operator was constructed from the eigen- vectors of the 
block-restricted lattice Dirac operator D{x,y) :x,y ^ K. With this choice the enhancement of solver 
speed has been achieved at modest quark masses, while the critical slowing down near critical mass 
still remained. 

Recently the adaptive multigrid method has been investigated and a significant speed up and 
removal of critical slowing down have been observed for the U(l) Schwinger model The key in 
this success is the same choice for projection/interpolation operator as Liischer's for the deflation 
subspace. The extension to 4D QCD has been presented this year by Clark et al. [fTOl], where the 
relation between the deflation and adaptive MG methods is clearly explained (like Eqs. ( |3.26| ) and 
( 3.27 )). The removal of critical slowing down is also observed for the QCD case with the adaptive 
MG method. 



4. Summary 

The lattice QCD simulations with dynamical quarks still require huge amount of computa- 
tional resources and development of efficient numerical algorithms. The recent trend in compu- 
tational machines in these years is the many core architecture. The peak performance is rapidly 
growing while the growth ratio of memory bandwidth is rather slow; the gap between the two is 
still the primary obstacle for high performance computing. To fill the gap the memory subsystem 
hierarchy is becoming increasingly complicated through an insertion an intermediate cache system 
or local memory attached to cores. The algorithm and programming close to the hardware imple- 
mentation is important to keep high sustained performance for lattice QCD simulations. GPGPU 
computing for lattice QCD is becoming more familiar by the appearance of the GPGPU-dedicated 
general programming language. Attempts are starting to make a large scale simulation with high 
cost performance, but the bottleneck is again the bandwidth between the GPU card and the host 
computer. The mixed precision/flexible preconditioner technique is a possible solution to remove 
the difficulty by reducing data flow. 

Preconditioning of the QCD partition function and the Sexton-Weingarten multiple-time scale 
MD is now a common and accepted technique for dynamical lattice QCD simulations. In this 
talk I covered several preconditioning techniques. The polynomial preconditioning and the RHMC 
algorithm represent a general methodology independent of the details of quark actions, while the 
geometric one strongly depends on the structure. With the Liischer's DDHMC algorithm, which 
fully makes use of the nearest-neighbor coupling nature of the Wilson quark action, simulations 
near the chiral limit is now possible. The geometric preconditioner dedicated and customized for 
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a given quark action will further enhances the performance of the HMC algorithm for other quark 
actions. For the MD integrator issue, the tuning method via the parametrized Omelyan integrator 
has been developed and is now available. 

The solver performance has been improved via the deflation technique which removes the 
critical slowing down and solver stagnation problems. The local coherence property of the low 
modes further enhances the performance. This year an adaptive multigrid technique has been 
applied for QCD and shown to remove the critical slowing down. The adaptive multigrid also 
makes use of the local coherence property of low modes in a different way. The realization of the 
local coherence deflation and the adaptive multigrid strongly depends on the structure of the quark 
action (nearest-neighbor coupling action) again. 

Combining these improvements the dynamical lattice QCD simulations are now possible close 
to the chiral limit on lattices with the lattice spacing a ~ 0. Ifm and the volume L = 3fm for Wilson 
type and staggered type fermions using 0(10) TFlops supercomputers. The empirical cost per- 



formance formula for the DDHMC algorithm [39] indicates the possibility of the dynamical QCD 



simulations at: (i) rriq ^ 5 MeV, L ~ 6 fm, a ~ 0.1 fm, (ii) ~ 5 MeV, L ~ 2 fm, a ~ 0.03 fm, with 
future 10 Peta flops supercomputers. The first case can handle multiple hadrons in a single box and 
the second one the charm quark physics with dynamical charm. These problems contain one more 
physical scales, (i) pion wave length l/m„ and (ii) charm quark wave length l/m^ in addition to 
1/L and I /a. Numerical simulations for multiple physical scale problems are difficult in general. 
A multiple data blocking is required when the problem is treated with a fat-node supercomputer. 
A matching between the physical scale and the machine structure is recommended for this case. I 
expect that the algorithm with fully utilizing the data blocking structure and physical scale blocking 
structure will cover these problems, for exapmle nesting the domain-decomposed HMC, or nesting 
deflation and multigrid via local coherence. Attempts at these physics have already started by sev- 
eral collaborations using 0(100) TFlops machines by making use of the improvement approaches 
described in the paper. 
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